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Abstract 

This work investigates how the traditional image classification pipelines can be extended 
into a deep architecture, inspired by recent successes of deep neural networks. We pro¬ 
pose a deep boosting framework based on layer-by-layer joint feature boosting and dic¬ 
tionary learning. In each layer, we construct a dictionary of filters by combining the 
filters from the lower layer, and iteratively optimize the image representation with a 
joint discriminative-generative formulation, i.e. minimization of empirical classification 
error plus regularization of analysis image generation over training images. For optimiza¬ 
tion, we perform two iterating steps: i) to minimize the classification error, select the 
most discriminative features using the gentle adaboost algorithm; ii) according to the 
feature selection, update the filters to minimize the regularization on analysis image rep¬ 
resentation using the gradient descent method. Once the optimization is converged, we 
learn the higher layer representation in the same way. Our model delivers several distinct 
advantages. First, our layer-wise optimization provides the potential to build very deep 
architectures. Second, the generated image representation is compact and meaningful. 
In several visual recognition tasks, our framework outperforms existing state-of-the-art 
approaches. 

Keywords: Representation Learning, Compositional boosting, Dictionary learning, 
Image Classification 


1. Introduction 

Visual recognition is one of the most challenging domains in the field of computer 
vision and smart computing. Many complex image and video understanding systems 
employ visual recognition as the basic component for further analysis. Thus the design 
of robust visual recognition algorithm is becoming a fundamental engineering in computer 
vision literature and has been attracting many related researchers. Since the inadequate 
visual representation will greatly influence the performance of visual recognition system, 
almost all of the related methods are concentrated on developing the effective visual 
representation. 
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Traditional visual recognition systems always adopt the shallow model to construct 
the image/video representation. Among them, the bag-of-visual-words (BoW) model, 
which is the most successful one for visual content representation, has been widely 
adopted in many computer vision tasks, such as object recognition 0,0 and image 
classification [3, 4|. The basic pipeline of BoW model consists of local feature extrac¬ 
tion ii, feature encoding jj, Q, [8] and pooling operation. In order to improve the 
performance of BoW, two crucial schemes have been involved. First, the traditional 
BoW model discards the spatial information of local descriptors, which seriously lim¬ 
ited the descriptive power of the feature representation. To overcome this problem, the 
Spatial Pyramid Matching method was proposed in [sj to capture geometrical relation¬ 
ships among local features. Second, dictionaries adopted to encode the local feature in 
traditional methods are learned in a unsupervised manner and can hardly capture the 
discriminative visual pattern for each category. This issue inspired a series of works 
hHUGI to train more discriminative dictionaries via supervised learning, which can 


be implemented by introducing the discriminative term into dictionary learning phase as 
the regularization according to various criteria. 

As the research going, the deep models, which can be seen as a type of hierarchical 
representation [Tillfl lllTo have played an significant role in computer vision and machine 
learning literature [16i . [17l, |18] in recent years. Generally, such hierarchical architecture 
represents different layer of vision primitives such as pixels, edges, object parts and 
so on 0. The basic principles of such deep models are concentrated on two folds: (1) 
layerwise learning philosophy, whose goal is to learn single layer of the model individually 
and stack them to form the final architecture; (2) feature combination rules, which aim at 
utilizing the combination (linear or nonlinear) of low layer detected features to construct 
the high layer impressive features by introducing the activation function. 

In this paper, the related exciting researches inspire us to explore how the tradi¬ 
tional image classification pipelines, which include feature encoding, spatial pyramid 
representation and salient pattern extraction (e.g., max spatial pooling operation), can 
be extended into a deep architecture. To this end, this paper proposes a novel deep 
boosting framework, which aims to construct the effective discriminative features for 
image classification task, jointly adopting feature boosting and dictionary learning. For 
each layer, followed the famous boosting principle [20], our proposed method sequentially 
selects the discriminative visual features to learn the strong classifier by minimizing em¬ 
pirical classification error. On the other hand, the analysis dictionary learning strategy 
is involved to make the selected features more suitable for the object category. A two- 
step learning process is investigated to iteratively optimize the objective function. In 
order to construct high-level discriminative representations, we composite the learned 
filters corresponding to selected features in the same layer, and feed the compositional 
results into next layer to build the higher-layer analysis dictionary. Another key to our 
approach is introducing the model compression strategy when constructing the analy¬ 
sis dictionary, that reduces the complexity of the feature space and shortens the model 
training time. The experiment shows that our method achieves excellent performance 
on general object recognition tasks. Fig. |T] illustrates the pipeline of our deep boosting 
method (applying two layers as the illustration). Compared with the traditional BoW 
based method |zj, the analysis operation in our model (i.e., convolution) is same as the 
encoding process that maps the image into the feature space. While the pooling stage is 
same as the traditional method to compute the histogram representation adopting spatial 






pyramid matching. Different from traditional models capturing the salient properties of 
visual patterns by max spatial pooling operation, we adopt the feature boosting to the 
discriminative features mining for image representation. 
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Figure 1: A two-layer illustration of proposed deep boosting framework. The horizontal 
pipelines show the layer-wised image representation via joint feature boosting and analy¬ 
sis dictionary learning. When optimization in the single layer is done, the compositional 
filters are fed into the higher-layer to generate the novel analysis dictionary for further 
processing. Note that the feature set in the higher-layer only dependents on the training 
images and combined filters in the relevant layer. 


The main contributions of this paper are three folds. (1) A novel deep boosting 
framework is proposed and it leverages the generative and discriminative feature repre¬ 
sentation. (2) It presents a novel formulation which jointly adopting feature boosting and 
analysis dictionary learning for image representation. (3) In the experiment on several 
standard benchmarks, it shows that the learned image representation well discovers the 
discriminative features and achieves the good performance on various object recognition 
tasks. 

The rest of the paper is organized as follows. Sec. [2] presents a brief review of related 
work, followed by the overview of background technique details in Sec. [3] Then we 
introduce our deep boosting framework in Sec. Q1 Sec. [5] gives the experimental results 
and comparisons. Sec. [G] concludes the paper. 


2. Related Work 


In the past few decades, many works have been done to design different kinds of fea¬ 
tures to express the characteristics of the image for further visual tasks. These hand-craft 
features vary from global expressions 211 to the local representation @. Such designed 
features can be roughly divided into two types 22], the one is geometric features and 
the other is texture features. Geometric features which explicitly record the locations 
of edges are employed to describe the noticeable structures of local areas. Such features 
include Canny edge descriptor [23|, Gabor-like primitives 24 and shape context descrip¬ 
tor [ 25 ], |2^| . In contrast, the texture features express the cluttered object appearance 










































by histogram statistics. SIFT 0, HoG 0 and GIST [27] are delegates of such feature 
representation. Beyond such hand-craft feature descriptors, Bag-of-Feature (BoF) model 
seems to be the most classical image representation method in computer vision area. A 
lot of illuminating studies MBS were published to improve this traditional approach 
in different aspects. Among these extensions, a class of sparse coding based methods 
0 S, which employ spatial pyramid matching kernel (SPM) proposed by Lazebnik et 
al , has achieved great success in image classification problem. However, despite we are 
developing more and more effective representation methods, the lack of high-level image 
expression still plagues us to build up the ideal vision system. 

On the other hand, learning hierarchical models to simultaneously construct multiple 
levels of visual representation has been paid much attention recently. The proposed 
hierarchical image representation is partially motivated by recent developed deep learning 
approaches 13l 14, 20. Different from previous hand-craft feature design method, deep 
model learns the feature representation from raw data and validly generates the high- 
level semantic representation. And such abstract semantic representations are expected 
to provide more intra-class variability. Recently, many vision tasks achieve significant 
improvement using the convolutional architectures flfil . ll7L IT0 . A deep convolutional 
architecture consists of multiple stacked individual layers, followed by an empirical loss 
layer. Among all of these layers, the convolutional layer, the feature pooling layer and the 
full connection layer play major roles in abstract feature representation. The stochastic 
gradient descent algorithm is always applied to the parameters training in each layers 
according to back-propagation principle. However, as shown in recent study 120, these 
network-based hierarchical models always contain thousands of parameters. Learning 
a useful network usually depends on expertise of parameter tuning (e.g., tuning the 
learning rate and parameter decay rate in each layer ) and is too complex to control in 
real visual application. In contrast, we build up our hierarchical image representation 
according to the simple but effective rules. Our method can also achieve the near optimal 
classification rate in each layer. 

Another related work to this paper is learning a dictionary in an analysis prior [2fll . 
l30l . [dll . The key idea of analysis-based model is utilizing analysis operator (also known 
as analysis dictionary) to deal with latent clean signal and leading to a sparse outcome. 
In this paper, we consider the analysis-based prior as a regularization prior to learn more 
discriminative features to a certain category. Please refer to Sec[3]for more details about 
analysis dictionary learning. 


3. Background Overview 

3.1. Gentle Adaboost 

We start with a brief review of Gentle Adaboost algorithm [20. Without loss of 
generality, considering the two-class classification problem, let (xi,yi)...(xisr,V n) be the 
training samples, where Xi is a feature representation of the sample and yi £ {—1,1}- Wi 
is the sample weight related to x,;. Gentle Adaboost [20H] provides a simple additive 
model with the form, 

M 

F ( x i) = V 


(1) 






where f m is called weak classifier in the machine learning literature. It often defines f m as 
the regression stump f m {xi) = ah(xf > 5) +b, H(-) denotes the indicator function which 
returns 1 when xf > 5 and 0 otherwise, xf is the d-th dimension of the feature vector Xi, 
S is a threshold, a and b are two parameters contributing to the linear regression function. 
In iteration to, the algorithm learns the parameter (d, d, a, b) of / m (-) by weighted least- 
squares of yi to Xi with weight Wi, 

N 

min y^Wi || a d h(xf > 6 d ) + b d - yi || 2 , (2) 

l<d<D 

- - i=i 


where D is the dimension of the feature space. In order to give much attention to the 
cases that are misclassified in each round, Gentle Adaboost adjusts the sample weight in 
the next iteration as Wi <— Wie~ Vi ^ Xi > and updates F(xi) <— F(xi)+f m (xi). At last, the 
algorithm outputs the result of strong classifier as the form of sign function sign[F(xi)}. 
In this paper, we adopt Gentle Adaboost as the basic component of proposed model. 
Please refer to 20|, 321 for more technique details. 


3.2. Analysis Dictionary Learning 

Our work is also inspired by the recent developed analysis-based sparse representation 
prior learning [29113(1131| , which represents the input signal from a dual viewpoint of the 
commonly used synthesis model [331 ]. The main idea of analysis prior leaning is to learn 
the analysis operators (e.g., convolution operator) that can return the special responses 
(e.g., sparse response as usual) from the latent signal according to the given constraint. 
Let I be the observed signal (e.g., natural image) with noisy which is often assumed 
as zero-mean white Gaussian. An analysis-based prior seeks the latent signal / whose 
analysis transform result is sparse, 

mini||/-/||^+'0^(G*^), (3) 

where > 0 is a scalar constant and the symbol * indicates the analysis operation. 
The first term denotes the reconstruction error and the second one denotes the sparsity 
constraint of the forward transform coefficient. G is usually a redundant dictionary 
employing as the analysis operator. In different context, such analysis prior G is more 
frequently adopted to enforce some regularity on the signal. In this paper, we utilize 
the philosophy of analysis-based prior to seek the discriminative filters for image feature 
representation. Please refer to [2^, HIU] for more technique details and theoretical 
analysis. 


4. Problem Formulation 

Considering the two-class classification problem, for given training data and its cor¬ 
responding label {(xi,yi)\i G {1 ,... ,N}},yi G {—1,1}. In order to construct the rich 
and discriminative image representation for each category, we propose a deep boosting 
framework based on compositional feature selection and analysis dictionary learning. For 
a single layer, we firstly introduce the term of empirical error to the discriminative fea¬ 
tures mining. This is equal to learn the weak classifier in Gentle Adaboost algorithm. For 
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Figure 2: Toy example of constructing a three-level pyramid histogram as the image 
feature representation. 


each category, suppose that if we can find an analysis dictionary, denoted by G G R pxM , 
that the selected feature can be more suitable for such category by the analysis transfor¬ 
mation, then the feature representation would be more effective for visual recognition. 
Based on this idea, the fundamental of our single layer image representation is expressed 
as follows, 


Hg 11 \ 52 + X 52 W G *^W^ ( 4 ) 

i=1 

where Xi is the feature representation corresponding to image Ii and l(-) denotes the em¬ 
pirical error of the classifier, fl indicates positive training set and Ij ^ f l means that the 
image Ij does not belong to the set of positive samples. We define G = [<?i, <72 , ■■■9m—, 9 m] 
as the analysis dictionary and each g m indicates a linear filter. Thus G* I can be consid¬ 
ered as the a series of convolutional operations and the output is M feature maps, each 
of which is related to a special linear filter. The properties of our proposed model are 
two folds. On one hand, different from traditional analysis prior learning, we adopt the 
empirical error, which is more suitable for training the classifier, to replace the recon¬ 
struction error in Eq.©. On the other hand, the analysis operator is introduced as the 
regularized term to learn more discriminative features for each category. In the second 
term of Eq. O. we desire the analysis dictionary (i.e., a set of filters) has large filter re¬ 
sponse over the positive training set. In this way, the analysis dictionary learning process 
could discover category coherent features (i.e., one category one analysis dictionary) to 
promote the discriminative ability of weak classifiers. It is equivalent to make the anal¬ 
ysis dictionary has the small response over negative samples, thus we extract negative 
training samples and minimize the objective function to train the analysis dictionary. 
Note that, if the learned filter has the small response to both the positive and negative 
samples, the related feature representation will be eliminated in the further iteration of 
feature selection process. In this way, the discriminative of our image representation is 
enhanced by joint feature boosting and analysis dictionary learning, leading the model 


























































































more robust and compact as well. 

In Erp®, Xi is the feature vector of i -th image associated with the analysis trans¬ 
formation (z.e., Hlter response or convolution result). In order to obtain such feature 
representation, we employ the pyramid-wise histograms to quantize the filter responses, 
which provide some degree of translation invariance for the extracted features, as in hand¬ 
crafted features ( e.g ., SIFT or HoG), learned features (e.g., Bag-of-Visual-Words model), 
and average or maximum pooling process in convolution neural network. Suppose M is 
the total number of filters. Before construct the pyramid-wise histograms for a special 
image /, we firstly activate the maximum filter responses of each pixel and abandon the 
others as follows, 


_ J \\u m \\ if \\u m \\ = max{||ui||, ||u 2 ||,||wm||} 
m | 0 otherwise 

where u m indicates the ?n-th hlter response of pixel u £ I. 

According to the previous operation, we can obtain Ad feature maps for a training 
image, each of which has only a few locations being activated according to Eq.© (pre¬ 
sented by red solid circle in Fig. El. As shown in Fig 13 we apply a three-level spatial 
pyramid representation of each resulting feature map, resulting 1 + 2x2 + 4x4 = 21 
individual spatial blocks. We compute the histogram (with C bins, C = 50 in the rest 
of the paper) of the hlter responses in each block. Finally, we can get the “long” feature 
vector formed by concatenating the histograms of all blocks from all feature maps. The 
dimension of such feature vector is 21 x 50 x M. Note that M is not a constant scalar 
in this paper, and the value could be dynamically changed with the process of analysis 
dictionary learning. Please refer to Sec. 14.21 for more details. 

4-1- Feature Boosting 

In order to optimize the objective function in Eq.fl]), we propose a two-step optimizing 
strategy integrating the feature boosting and dictionary learning. In this subsection, we 
describe the details of feature boosting method by setting up the relationship between the 
weak classifier and the image feature representation. After the pyramid-wise histogram 
calculated, we select the discriminative features and obtain the single layer classiher 
through the given feature set. Follow the previous notation, let Xi £ R D be the feature 
representation of image where D is the dimension of the feature space and D = 
21 x 50 x M as described in the previous content. In the feature boosting phase, Gentle 
Adaboost is applied to the discriminative features (i.e., weak classifiers ) mining, which 
can separate the positive and negative samples nicely in each round. Note that in the 
rest of the paper, we apply xf to denote the value of Xi in the d -th dimension. In 
each round of feature boosting procedure, the algorithm retrieves all of the candidate 
regression functions {f 1 , f 2 ,..., f D }, each of which is formulated as: 

f d (xi) = a</)(xf -S) + b, (6) 

where </>(•) is the sigmoid function with the form <f>(x) = 1/(1 + e~ x ). For each round, 
the candidate function with minimum empirical error is selected as the current weak 


classifier /, such that 


N 

|| f d (xi) - y z || 2 , (7) 

2=1 

where f d {xi) is associated with the d-th element of Xi and the function parameter (<5, a, b). 
According to the above discussion, we build the bridge between the weak classifier and 
the feature representation, thus the weak classifiers learning can be viewed as the feature 
boosting procedure in our model. The feature boosting is usually terminated when the 
training error is converged. 

4-2. Analysis Dictionary Learning 

To the regularization perspective, another advantage of method is introducing analysis 
dictionary learning, which is conducted by selected features in the feature boosting phase, 
to emphasize the discriminative ability of analysis operator for the target category. In our 
framework, since we rely on discriminative filters to generate higher-layer proper analysis 
dictionary, we only consider to update a subset of filters which is corresponding to the 
selected features. We first need to construct the relationship between feature responses 
and filters. For any feature response, a four-item index is recorded as, 

[isActivited,w,h,g], (8) 

where isActivited indicates whether the feature response is selected in feature boosting 
stage, w, h are the horizontal and vertical coordinate in the image lattice domain re¬ 
spectively. g denotes the relative filter defined in Eq.(U). Then we apply the gradient 
descent algorithm to optimize filters which is corresponding to selected features. As Fig[T] 
illustrates, we combine any two optimized filters but not the features to generate filters 
in the next layer. In this way, the filter’s optimization in the next layer is independent 
with previous features. Note that in the first few layers, the number of filters is limited, 
thus almost every filter is taken into account in optimization. However, it will show in 
Sec l4.3l that the collection of compositional filters becomes large along with the architec¬ 
ture going deep, thus the screening mechanism is introduced to control the complexity 
and keep the effectiveness of the model. 

Integrating the two stages described in Sec. Id.II and Sec. 14.21 we achieve the feature 
boosting and analysis dictionary learning for the single layer. The algorithm is sum¬ 
marized in Alg|T| In next subsection we will introduce the filter combination rules to 
construct the hierarchical architecture of our model. 

4-3. Deep Boosting Framework 

In the context of boosting method, the strong classifier, which is usually the weighted 
linear combination of weak classifiers, is hardly to decease the test error when training 
error is approaching to zero. Based on this fact, it is our interest to learn high-level 
feature representations with more discriminative ability. In order to achieve this goal, 
we propose the filter combination rules and the output compositional filters of each layer 
are treated as a whole to generate the analysis dictionary in the next layer. 


(a) Illustration of compositional filters. 



(b) The similarity matrix of 2nd 
layer. 



(c) The similarity matrix of 3rd 
layer. 


Figure 3: Illustration of compositional filters for deep boosting. We composite filters in 
a pairwise manners in each layer and treat the output compositional filters as base filters 


(presented by solid circle in Fig. |3(a)~j ) in next layer. After combination, the similar matrix 
of filters is built up to drop out redundancies (presented by hollow circle in Fig. 3(a) I. 


For each image category, whose corresponding analysis dictionary in layer l is denoted 
by [G]i, we combine any two optimized filters (presented by solid circle in Fig. 3(a) I in 
the Z-th layer as follows, 


[g k ]i+i = <t>{ [gi\i + \gj]i ), (9) 

where (/>(•) is the sigmoid function. [gi]i and [gj\i indicate the i-th and j-th filters in the 
optimized subset of [G]j. As illustrated in Fig. |3(a)| the number of filters in each layer 
is quite different and we only adopt the optimized ones, which are related to selected 
features, to construct the image filters for the next layer. 

4-4- Model Compression Approach 

Although we carefully select filters for further combination, the number of com¬ 
positional filters will still be out of control when architecture going deep. Assuming 
there exists Mi optimized filters in layer Z, thus we can obtain he maximum number 















I x Mi x (Mi — 1) of compositional filters. In this way, the dimension of each image in 
the layer l + 1 would be | x tf; x (Mi — 1) x 21 x 50, which make the feature space is 
too complex and the training time becomes intolerable. To this end, we introduce model 
compression in the training phase. For any couple of filers, the L2 distance is calculated 
to measure the similarity between them. If the distance is smaller than the threshold 
S (set as 0.7 in all the experiment), we maintain the two filters are similar and one of 
them is dropped out randomly (presented by hollow circle in Fig. |3(a)[ ). Fig. |3(b)| and 
Fig. 3(c) illustrate the similarity matrix of filters in different layer. The intensity of every 
square indicates the similar degree of two filters. Please refer to Fig. [6] and Fig. [7] for 
more details about the classification accuracy and training time comparison with and 
without model compression for different depth of proposed framework. 

According to Sec. 14.31 we build up the hierarchical architecture of our deep boosting 
framework. In the testing phase, we employ the weak classifiers learned in every layer to 
produce the final classifier. The overall of our proposed method is summarized in Alg. [2] 


Algorithm 1 Joint Feature Boosting and Analysis Dictionary Learning 

Input: 

Positive and negative training samples (aq, yi)...(xN, Un), the number of selected fea¬ 
tures n. 

Output: 

A pool of selected features 'F, the learned dictionary G. 

Initialization: 

The dictionary G; 

Repeat 

1. Start with score F(x) = 0 and sample weights Wi = 1/A, i = 1,2,..., A. 

2. Select features and learn the strong classifier as follows: 

Repeat for m = 1, 2,..., II: 

(a) Learn the current weak classifier f m by Eq.©. 

(b) Update Wi X— and renormalize. 

(c) Update F(x) x— F(x) + f m (x). 

3. Update the dictionary G by gradient descent method. 

4. Generate new feature vectors of each image using G according to Sec. |4] 
until The objective function in Eq.fl]) converges. 


4-5. Preprocessing and Multi-class Decision 

At the beginning, we initialize the filters with the size of 5x5 adopting Gabor wavelets. 
Let I be an image defined on image lattice domain and G be the Gabor wavelet elements 
with parameters (w, h, a , s), where (w, h) is the central position belonging to the lattice 
domain, a and s denote the orientation and scale parameters. Different orientation and 
scale parameters makes Gabor wavelets variant. For simplicity, we apply 1 scale and 16 
orientations in our implementation, so there are total 16 filters at first layer. Notably, 
multi scales promote the performance while the filter combination process becomes com¬ 
plicated, because the combination is only allowed in the same scale. Followed by [34j , we 










Algorithm 2 Deep Boosting Framework 

Input: 

Positive and negative training images and corresponding labels (I±, z/i)...(7at, Un), the 
number of selected features 11; in layer l, the total layer number L. 

Output: 

The final classifier F L (x) for a special category. 

Initialization: 

Initialize G in first layer applying Gabor wavelets. 

Repeat for Z = 1, 2,..., L: 

1. Generate new feature x of image I using G according to Sec. [I] 

2. Boost features with dictionary learning according to Alg. [T] 

3. Build up filters of next layers according to Eq.©. 


utilize the normalize term to make the Gabor responses comparable during the inception 
phase between different training images: 

(!°) 

' ' a W,h 

where |P| is the total number of pixels in image 7, and A is the number of orientations. 
(•) denotes the convolution process. For each image 7, we normalize the local energy as 
|(I, G w h a S )| 2 /<5 2 (s) and define positive square root of such normalized result as feature 
response. 

To the multiclass situation, we consider the naive one-vs-all scheme to train multiple 
binary classifiers, each one learns to distinguish the samples in a single class from the 
samples in all remaining classes. Given the training data {(xi,yi)}fL 1 ,yi € {1,2 
we train K strong classifiers, each of which returns a classification score for a special test 
image. In the testing phase, we predict the label of image referring to the classifier with 
the maximum score. The reason why we adopt one-vs-all or OVA scheme throughout the 
paper is concentrated on two folds. On one hand, according to the Eq.fl]), we desire each 
learned analysis dictionary should have powerful capability to distinguish the images 
from one category. Thus we select the negative samples from all other categories to 
optimize the filters in Eq.((4]) (z.e., leaning the class-specific analysis dictionary) and this 
strategy is naturally consistent with the OVA scheme. On the other hand, as shown in 
(35j |. many multiclass models may not offer advantages over the simple OVA scheme in 
the solution of classification problem. Under such circumstances, we finally choose the 
OVA strategy followed by its intuitive concept. 


5. Experiment 

We conduct several experiments to investigate the properties of proposed deep boost¬ 
ing framework and evaluate the performance for different challenging visual recognition 
tasks (z.e., facial age estimation, natural image classification and similar appearance 
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Figure 4: The learned templates in the first four layers for each image categories. When 
the model goes deeper, we get higher level primitives and the more discriminative fea¬ 
tures. 


categories recognition). All of the experiments are carried out on a PC with Core i7- 
3960X 3.30 GHZ CPU and 24GB memory. In these tasks, we demonstrate superior or 
comparable performances of our framework over other state-of-the-art approaches. 

5.1. Learning image template for image categories 

In the first experiment we focus on whether our algorithm can learn and select mean¬ 
ingful and discriminative features for different image categories. Take CIFAR-10 dataset, 
for example. The CIFAR-10 dataset Q consists of 60K 32 x 32 color images in 10 classes 
(with 6K images per class), including airplane, automobile, bird, cat, deer, dog, frog, 
horse, ship and truck. We randomly select 1,000 images per class as the training samples 
to learn the hierarchical image representation. FigU shows some learned templates in 
different layers for each image categories. According to the visualizations, it is obviously 
that the higher layer it goes, the more informative features we gain. 

5.2. Natural image classification 

The same to CIFAR-10, the STL- 1(0 is also a ten-category image dataset, but with 
the image size 96 x 96. It has 1300 images per class. There are 500 training images and 
800 test images. The training set is mapped to ten predefined folds. Due to its relatively 
large image size, much prior research chose to downsample the images to 32 x 32. Tab. [I] 
shows the comparison of average test accuracies on all folds of STL-10. It is clear that our 
method can achieve very competitive results compared to other state-of-the-art methods. 


1 http://www.es. toronto.edu/~kriz/cifar.html 
2 http://cs.stanford.edu/~acoates/stllO/ 

























Figure 5: The empirical error at boosting rounds. The method with regularized term 
has better convergence rate. 
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Figure 6: Classification accuracy at different layers. The method conduct better perfor¬ 
mance with the growth of model. 


5.2.1. Impact of analysis dictionary learning 

In this section, we are interested in the performance of our method in context of 
analysis dictionary learning. As we mentioned above, the analysis operator is introduced 
as a regularized term to learn more discriminative features over the positive samples. 
We desire that the analysis dictionary is able to make the margin between positive and 
negative training sets as larger as possible. That is, the analysis dictionary has large 
response over the positive training set, but not vice versa. Note that, the related feature 
representation will be eliminated in the further iteration of feature selection process, if 
the learned filter responds a small value both to the negative set and to the positive 
set. In this way, we will gain more discriminative features in feature boosting procedure, 
resulting a more robust and compact image representation model. 

Tab. [2] shows the classification accuracy with and without regularized term. The 
result using regularized term outperforms the other and the standard deviation among 
folds is smaller, which illustrates that the feature is more discriminative and the model 













Table 1: Classification accuracy on STL-10. 


Method 

Accuracy (±er) 

1-layer Vector Quantization[36] 
1-layer Sparse Coding [36] 
3-layer Learned Receptive Field[37] 
OURS-5 

54.9% (± 0.4%) 
59.0% (± 0.8%) 
60.1% (± 1.0%) 
59.3% (± 0.8%) 



Figure 7: The average training time of categories at different layers. The average training 
time of categories greatly reduce when the model is compressed. 



is more robust. In Fig. [5J the empirical error in boosting phase is shown. For the more 
discriminative features, it is reasonable to accelerate convergence rate using regularized 
term. 

5.2.2. Impact of model depth and compression 

In this experiment, we perform classification experiments on the STL-10 in the con¬ 
text of different number of layers. We learn the deep boosting model to construct multiple 
levels of visual representation simultaneously. In order to construct high-level discrimina¬ 
tive representations, we composite the learned filters corresponding to selected features 
in the same layer, and feed the compositional results into next layer to build the higher- 
layer analysis dictionary. Hopefully when the model goes higher, the features is more 
discriminative. Fig. [S] exhibits the performance of image classification on STL-10 at dif¬ 
ferent layers. The results demonstrate that the features in higher layer conduct better 
performance. In order to avoid the sudden explosion of filters, we drop out similar filters 
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(a) The original images. 
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(b) The aligned and cropped images. 


Figure 9: The MORPH-II dataset. Four individuals in different races and genders are 
picked as an example. The ages are given around the images. 



Figure 10: Cumulative scores at different error levels on MORPH-II. 


randomly after pairwise combination of the learned filters. Although it losses accuracy 
slightly, we control the training time and make the limitless growth of model possible, 
which is illustrated in Fig. [G] and Fig. [3 

5.3. Similar appearance categories recognition 

The LHI-Animal-Faces dataseld [22 ) consists of about 2200 images for 20 categories. 
Fig. [8] provides an overview of the dataset. In contrast with other general classification 


3 http: / / www.stat.ucla.edu/~zzsi/hit / changelog.html 






















Table 2: Classification accuracy on STL-10 dataset with and without regularized term. 



Accuracy (±<r) 

with regularized term 
without regularized term 

59.3% (± 0.8%) 
55.8% (± 1.5%) 


datasets, LHI-Animal-Faces contains only animal or human faces, which are similar to 
each other. It is challenging to discern them for their evolutional relationship and shared 
parts. Besides, interesting within-class variation is shown in the face categories, including 
rotation, flip transforms, posture variation and sub-types. 

We compare our result with those reported in [381 obtained by other methods, which 


include HoG feature trained with SVM |6|, HIT [22|, AOT [38[ and partbased HoG 
feature trained with latent SVM j39j. In experiment, we splits the dataset as training set 
and test set following AOT [38j. For our method, we resize all the images to the uniform 
size of 60 x 60 pixels and the number of layers is 5. Tab. [3] exhibits the classification 
accurracy on LHI-Animal-Faces. It has shown that our method achieves a 2.4% increase, 
compared with the second best competitor. 

Table 3: Classification accuracy on LHI-Animal-Faces. 


Method 

Accuracy 

HoG+SVM 

70.8% 

HIT [22] 

75.6% 

LSVM[39] 

77.6% 

AOT [38] 

79.1% 

OURS-5 

81.5% 


5-4 ■ Facial age estimation 

Human age estimation based on facial images plays an important role in many appli¬ 
cations, e.g., intelligent advertisement, security surveillance monitoring and automatic 
face simulation. To our best knowledge, MORPH- 10 is the largest publicly available 
dataset for facial age estimation. In the MORPH-II dataset, there are more than 55, 000 
facial images from more than 13, 000 individuals with only about 4 labeled images per 
individual. The ages vary over a wide range from 16 to 77. The individuals come from 
different races, among them Africans accounted for about 77%, the Europeans about 
19%, and the remaining includes Hispanic, Asian and other races. Some sample images 


are shown in Fig. 9(a' 


We use two usually performance measures in our comparative study, i.e., MAE (Mean 
Absolute Error) and CumScore (Cumulative Score) Ho}. Suppose there are N test im¬ 
ages, the MAE is the sum of average absolute errors between the true ages ai and the 


4 http://www.faceaginggroup.com/morph/ 











predicted ages a,, i = 1,2, • • • ,1V. The MAE is calculated as, 

1 N 

MAE =—Y^\ai~ai\, (11) 

i 

where | • | denotes the absolute value of a scalar value. 

The CumScore is the cumulate accuracy rate. A certain error range [i.e., I years) 
is acceptable for many real applications. The cumulative score at error level l can be 
calculated as, 


CumScore(l) = N e ^i/N x 100%, (12) 

where N e <^i is the number of test images, which have absolute prediction error no more 
than l years. 

For an input image, we locate the face with bounding box and detect the five facial 
key points in the bounding box. The five facial key points include two eye centers, nose 
tip, and two mouth corners. Then we align the facial image based on these key points. 
Finally, the images are resized to the size of 60 x 60 pixels. The aligned images are shown 
in Fig. |9(b)| 

We compare our results with several exis ting algorithms designed for the age estima¬ 
tion, i.e., IIS-LLD [4ll| . WAS |4f| and AGES [431] . Moreover, we also conduct experiments 
using some feature descriptors usually used in face recognition, including Multi-level LBP 
[iij], HoG [g] and SIFT Q. For all of these features, age estimation is treated as classi¬ 
fication problem using multi-class SVMs. For our method, we set the number of layers 
to 2 and six-folder cross validation is performed. Tab. [I] summarizes the results based 
on the MAE measure. We can see that our method achieves better results compared to 
other state-of-the-art methods for age estimation. We also report the results in terms of 
the cumulative scores at different error levels from 0 to 10 in Fig. [10] exhibiting that our 
method outperforms other state-of-the-arts at almost all levels. 

Table 4: MAE (in Years) on MORPH-II (the lower the better). 


Method 

MAE 

MLBP+SVM 

6.85 

HoG+SVM 

6.19 

SIFT+SVM 

8.77 

WAS [42] 

9.21 

AGES [43] 

6.61 

IIS-LLD [41] 

5.67 

OURS-2 

5.61 


6. Conclusion 

In this paper, we propose a novel deep boosting framework, which is applied to 
construct the high-level discriminative features for general image recognition task. For 
each layer, the feature boosting and analysis dictionary learning are integrated into a 







unified framework for discriminative feature selection and learning. In order to construct 
high-level image representation, the combined filters in the same layer are fed into next 
layer to generate the novel analysis dictionary. The experiments in several benchmarks 
demonstrate the effectiveness of proposed method and achieve good performance on 
various visual recognition tasks. 


Acknowledgement 

This work was supported by the National Natural Science Foundation of China (no. 
61170193, no. 61370185), Guangdong Science and Technology Program (no. 2012B031500006), 
Guangdong Natural Science Foundation (no. S2012020011081, no. S2013010013432), 
Special Project on Integration of Industry, Education and Research of Guangdong Province 
(No. 2012B091000101), and Program of Guangzhou Zhujiang Star of Science and Tech¬ 
nology (No. 2013J2200067). Corresponding authors of this work is Liang Lin. 


References 

References 

[1] K. Grauman, T. Darrell, The pyramid match kernel: Discriminative classification with sets of image 
features, in: Computer Vision, 2005. ICCV 2005. Tenth IEEE International Conference on, Vol. 2, 
IEEE, 2005, pp. 1458-1465. 

[2] J. Winn, A. Criminisi, T. Minka, Object categorization by learned universal visual dictionary, in: 
Computer Vision, 2005. ICCV 2005. Tenth IEEE International Conference on, Vol. 2, IEEE, 2005, 
pp. 1800-1807. 

[3] S. Lazebnik, C. Schmid, J. Ponce, Beyond bags of features: Spatial pyramid matching for recognizing 
natural scene categories, in: Computer Vision and Pattern Recognition, 2006 IEEE Computer 
Society Conference on, Vol. 2, IEEE, 2006, pp. 2169-2178. 

[4] L. Fei-Fei, P. Perona, A bayesian hierarchical model for learning natural scene categories, in: Com¬ 
puter Vision and Pattern Recognition, 2005. CVPR 2005. IEEE Computer Society Conference on, 
Vol. 2, IEEE, 2005, pp. 524-531. 

[5] K. Mikolajczyk, C. Schmid, Scale & affine invariant interest point detectors, International journal 
of computer vision 60 (1) (2004) 63-86. 

[6] N. Dalai, B. Triggs, Histograms of oriented gradients for human detection, in: Computer Vision 
and Pattern Recognition, 2005. CVPR 2005. IEEE Computer Society Conference on, Vol. 1, IEEE, 
2005, pp. 886-893. 

[7] J. Yang, K. Yu, Y. Gong, T. Huang, Linear spatial pyramid matching using sparse coding for image 
classification, in: Computer Vision and Pattern Recognition, 2009. CVPR 2009. IEEE Conference 
on, IEEE, 2009, pp. 1794-1801. 

[8] J. Wang, J. Yang, K. Yu, F. Lv, T. Huang, Y. Gong, Locality-constrained linear coding for image 
classification, in: Computer Vision and Pattern Recognition (CVPR), 2010 IEEE Conference on, 
IEEE, 2010, pp. 3360-3367. 

[9] X. Zhou, K. Yu, T. Zhang, T. S. Huang, Image classification using super-vector coding of local 
image descriptors, in: Computer Vision-ECCV 2010, Springer, 2010, pp. 141-154. 

[10] Z. Jiang, Z. Lin, L. S. Davis, Label consistent k-svd: Learning a discriminative dictionary for 
recognition, Pattern Analysis and Machine Intelligence, IEEE Transactions on 35 (11) (2013) 2651 
2664. 

[11] J. Mairal, J. Ponce, G. Sapiro, A. Zisserman, F. R. Bach, Supervised dictionary learning, in: 
Advances in neural information processing systems, 2009, pp. 1033—1040. 

[12] M. Yang, L. Zhang, X. Feng, D. Zhang, Sparse representation based fisher discrimination dictionary 
learning for image classification, International Journal of Computer Vision 109 (3) (2014) 209—232. 

[13] Y. LeCun, L. Bottou, Y. Bengio, P. Haffner, Gradient-based learning applied to document recogni¬ 
tion, Proceedings of the IEEE 86 (11) (1998) 2278-2324. 



[14] G. E. Hinton, S. Osindero, Y.-W. Teh, A fast learning algorithm for deep belief nets, Neural 
computation 18 (7) (2006) 1527-1554. 

[15] H. Lee, R. Grosse, R. Ranganath, A. Y. Ng, Convolutional deep belief networks for scalable unsu¬ 
pervised learning of hierarchical representations, in: Proceedings of the 26th Annual International 
Conference on Machine Learning, ACM, 2009, pp. 609-616. 

[16] S. Ji, W. Xu, M. Yang, K. Yu, 3d convolutional neural networks for human action recognition, 
Pattern Analysis and Machine Intelligence, IEEE Transactions on 35 (1) (2013) 221—231. 

[17] R. Girshick, J. Donahue, T. Darrell, J. Malik, Rich feature hierarchies for accurate object detection 
and semantic segmentation, in: Computer Vision and Pattern Recognition (CVPR), 2014 IEEE 
Conference on, IEEE, 2014, pp. 580-587. 

[18] S. Ding, L. Lin, G. Wang, H. Chao, Deep feature learning with relative distance comparison for 
person re-identification, Pattern Recognition 48 (10) (2015) 2993-3003. 

[19] M. D. Zeiler, R. Fergus, Visualizing and understanding convolutional networks, in: Computer 
Vision-ECCV 2014, Springer, 2014, pp. 818-833. 

[20] J. Friedman, T. Hastie, R. Tibshirani, et al., Additive logistic regression: a statistical view of 
boosting (with discussion and a rejoinder by the authors), The annals of statistics 28 (2) (2000) 
337-407. 

[21] Y. Rui, T. S. Huang, S.-F. Chang, Image retrieval: Current techniques, promising directions, and 
open issues, Journal of visual communication and image representation 10 (1) (1999) 39-62. 

[22] Z. Si, S.-C. Zhu, Learning hybrid image templates (hit) by information projection, Pattern Analysis 
and Machine Intelligence, IEEE Transactions on 34 (7) (2012) 1354-1367. 

[23] J. Canny, A computational approach to edge detection, Pattern Analysis and Machine Intelligence, 
IEEE Transactions on (6) (1986) 679-698. 

[24] B. A. Olshausen, et al., Emergence of simple-cell receptive field properties by learning a sparse code 
for natural images, Nature 381 (6583) (1996) 607-609. 

[25] L. Lin, X. Wang, W. Yang, J.-H. Lai, Discriminatively trained and-or graph models for object shape 
detection, Pattern Analysis and Machine Intelligence, IEEE Transactions on 37 (5) (2015) 959-972. 

[26] P. Luo, L. Lin, X. Liu, Learning compositional shape models of multiple distance metrics by infor¬ 
mation projection, Neural Networks and Learning Systems, IEEE Transactions on. 

[27] A. Oliva, A. Torralba, Modeling the shape of the scene: A holistic representation of the spatial 
envelope, International journal of computer vision 42 (3) (2001) 145-175. 

[28] P. Luo, X. Wang, X. Tang, A deep sum-product architecture for robust facial attributes analysis, 
in: Computer Vision (ICCV), 2013 IEEE International Conference on, IEEE, 2013, pp. 2864-2871. 

[291 M. Elad, P. Milanfar, R. Rubinstein, Analysis versus synthesis in signal priors, Inverse problems 
23 (3) (2007) 947. 

[30] P. Sprechmann, R. Litman, T. B. Yakar, A. M. Bronstein, G. Sapiro, Supervised sparse analysis and 
synthesis operators, in: Advances in Neural Information Processing Systems, 2013, pp. 908—916. 

[31] R. Rubinstein, T. Peleg, M. Elad, Analysis k-svd: A dictionary-learning algorithm for the analysis 
sparse model, Signal Processing, IEEE Transactions on 61 (3) (2013) 661-677. 

[32] A. Torralba, K. P. Murphy, W. T. Freeman, Sharing features: efficient boosting procedures for 
multiclass object detection, in: Computer Vision and Pattern Recognition, 2004. CVPR 2004. 
Proceedings of the 2004 IEEE Computer Society Conference on, Vol. 2, IEEE, 2004, pp. 762-769. 

[33] S. Gu, L. Zhang, W. Zuo, X. Feng, Projective dictionary pair learning for pattern classification, in: 
Advances in Neural Information Processing Systems, 2014, pp. 793-801. 

[34] Y. N. Wu, Z. Si, H. Gong, S.-C. Zhu, Learning active basis model for object detection and recog¬ 
nition, International journal of computer vision 90 (2) (2010) 198-235. 

[35] R. Rifkin, A. Klautau, In defense of one-vs-all classification, The Journal of Machine Learning 
Research 5 (2004) 101-141. 

[36] A. Coates, A. Y. Ng, The importance of encoding versus training with sparse coding and vector 
quantization, in: Proceedings of the 28th International Conference on Machine Learning (ICML-11), 
2011, pp. 921-928. 

[37] A. Coates, A. Y. Ng, Selecting receptive fields in deep networks, in: Advances in Neural Information 
Processing Systems, 2011, pp. 2528-2536. 

[38] Z. Si, S.-C. Zhu, Learning and-or templates for object recognition and detection, Pattern Analysis 
and Machine Intelligence, IEEE Transactions on 35 (9) (2013) 2189-2205. 

[39] P. F. Felzenszwalb, R. B. Girshick, D. McAllester, D. Ramanan, Object detection with discrimi¬ 
natively trained part-based models, Pattern Analysis and Machine Intelligence, IEEE Transactions 
on 32 (9) (2010) 1627-1645. 

[40] K. Smith-Miles, X. Geng, Z.-H. Zhou, Correction to” automatic age estimation based on facial 



aging patterns”, IEEE Transactions on Pattern Analysis and Machine Intelligence 30 (2) (2008) 
0368. 

[41] X. Geng, C. Yin, Z.-H. Zhou, Facial age estimation by learning from label distributions, Pattern 
Analysis and Machine Intelligence, IEEE Transactions on 35 (10) (2013) 2401-2412. 

[42] A. Lanitis, C. J. Taylor, T. F. Cootes, Toward automatic simulation of aging effects on face images, 
Pattern Analysis and Machine Intelligence, IEEE Transactions on 24 (4) (2002) 442-455. 

[43] X. Geng, Z.-H. Zhou, K. Smith-Miles, Automatic age estimation based on facial aging patterns, 
Pattern Analysis and Machine Intelligence, IEEE Transactions on 29 (12) (2007) 2234-2240. 

[44] D. T. Nguyen, S. R. Cho, K. R. Park, Human age estimation based on multi-level local binary 
pattern and regression method, in: Future Information Technology, Springer, 2014, pp. 433-438. 



